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Abstract 

This paper describes a new efficient conjugate subgradient algorithm which minimizes a convex 
function containing a least squares hdelity term and an absolute value regularization term. This 
method is successfully applied to the inversion of ill-conditioned linear problems, in particular for 
computed tomography with the dictionary learning method. 

A comparison with other state-of-art methods shows a significant reduction of the number of 
iterations, which makes this algorithm appealing for practical use. 
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I. INTRODUCTION 


Almost every field of science has, at some point, to tackle the linear inverse problem 
characterized by a matrix A. In this problem, the observations vector b can be expressed as 


b = Ax + e 


(hi) 


where x is the unknown signal to recover, A the process matrix, and e is some unknown 
noise. 

The common Bayesian approach is to model the noise e as an zero-mean Gaussian process 
of variance Id, and the unknown variable x as another random process. If the signal x 
is theoretically given, then the quantity Ax is deterministic; thus b = Ax -|- e is a random 
process. More precisely, since e A/'(0,cr^), then b ~ M {b — Ax,a‘^). The likelihood 
function of b is then given by 


p{b I x) 



\\b-^\ 

2a2 J 


( 1 . 2 ) 


where ||tc ||2 = w'^w is the squared Frobenius norm of w, that is, the sum of the squared 
components. Now since x is unknown, the Bayesian approach consists in modeling it as 
another ramdom process. If the signal is sparse in the representation Dx, where the columns 
of D are the vectors of the basis, we can approximate the a-priori probability of x using a 
Laplacian distribution, which implements the sparsity inducing Li norm [1] : 

p(a;) = ^exp(-/3 IlDxllJ (1.3) 


where ||tc||^ is the Li norm of w, that is, the sum of the components absolute values. The 
posterior probability p{x \ b) conditional on the observation vector b then reads 

Pi^ I ~ P(^ I ^)P(^) ~ ^ j (1.4) 

Eventually, the Maximum A Posteriori (MAP) approach amounts to minimizing the log- 


2 






Likelihood 


( 1 . 5 ) 


^'■r.\h) = i \\Ax - 5||j + /3||Da:||; 

which is the least squares formulation of (I.l) with a LI norm regularization. Notice that this 
penalty comes from the assumption made on the distribution of the values of x. Assuming 
normally distributed values of x would have led to the Tikhonov regularization [2] (L2 norm). 
L2-L1 minimization naturally arises in numerous applications when it comes to determine 
a solution with sparsity constraints. In signal processing, one can cite deconvolution, image 
zooming, image inpainting, motion estimation [3] and even tomographic reconstruction [4]. 
Generally speaking, L2-L1 is a special instance of the minimization problem 


argmin {F{x) = f{x) + 5 f(x)} (1.6) 

X 

where F is purposely split into a convex, smooth part /, and a convex, possibly non-smooth 
part g. This formulation is widely used for proximal splitting methods [5], which rely on the 
computation of the so-called proximal operator 


prox (x) = (Id-l-cl^f) ^ (x) = argmin - ||x — yW^ + 5 '(^) 


(1.7) 


where dg is the subdifferential of g ; 


dg(x) ^ {d\^y, tf(y-x)<g(y)-g{x)}. 


( 1 , 8 ) 


The sub differential is set-valued where g is not differentiable, and single-valued otherwise. 
For example, we have 9 H-H^ (x) = sign (x) if x 7 ^ 0, and 9 H-H^ (0) = [—1,1]. 

The case of L2-L1 minimization is a special instance of (1.6), where 


fix) = \ \\Ax-b\\l 
g{x) = (5 ||T>x||^ 


(1.9) 


An alternative formulation to (1.5) is the synthesis formulation 


argmin |||AiLw — 6 II 2 - 1 -/3 ||w||^} 

W 


(I.IO) 
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and is celebrated as the least absolute shrinkage and selection operator (LASSO) [6], while 
(1.5) implements, at variance with (1.10), an analysis approach. 

The formulation (1.5) corresponds to a linear inverse problem where Dx is constrained 
to be sparse. An example is the Total Variation regularization [7]; D = ||Vx||^. In the 
formulation (1.10), the solution x = Hw is synthesized from the coefficients w; these coeffi¬ 
cients are constrained to be sparse in some domain. An example is the Wavelet denoising 
for A = Id. These two approaches are equivalent if D is an orthonormal transform (and 
then H = D* the hermitian conjugate of D) [8]. However, in most cases, the theory and 
algorithms are more difficult in the analysis formulation. In proximal splitting methods, the 
computation of prox^ is straightforward in the formulation (1.10) {g = IHli), but not trivial 
in the formulation (1.5) {g = ||T)-||^). 

An alternative to proximal splitting methods is to adapt the functional F in (1.6) in 
order to use fast optimization algorithms like Newton or conjugate gradient. It usually boils 
down to smoothing the regularization term g{x). However, such approaches converge to an 
approximate solution of (1.6), which can be an issue if high sparsity constraint should be 
met. 

We present in this work an algorithm, based on a new conjugate sub-gradient method 
optimized for LASSO minimization. In the next section, after a brief recall of the conjugate 
gradient algorithm, we derive our algorithm. Section HI illustrates the applications with 
numerical examples : one for a very ill-conditioned matrix, and another for tomographic 
reconstruction with the dictionary-learning regularization. The convergence of this conjugate 
subgradient algorithm is compared to to the more general Nesterov [9] method. 

II. A CONJUGATE SUBGRADIENT ALGORITHM 

A. The nonlinear conjugate gradient algorithm 

In this section, we settle the notations by recalling the standard conjugate gradient algo¬ 
rithm. 

Let X denote the (vector) variable of the function F. For the remainder of this paper, the 
functional to minimize is F{x) = f{x) + g{x) with f{x) = \ \\Ax — bWl and g{x) = (5 ||a;||;^. 
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so the optimization problem is 


argmin |F(a;) = ^ ||Aa: — 6 II 2 +/3 ||a;||;^| (H-l) 

The conjugate gradient algorithm builds a set of conjugate directions {pk)k=i n where n 
is the number of iterations. Once the conjugate direction pk at iteration k, the variable is 
updated with x^+i = Xk + akPk- The scalar ak is the step size at iteration k, computed with 
a line search. The gradient of F is then evaluated in Xk+i to compute the next conjugate 
direction Pk+i- The computation of Pk+i actually only depends on the previous direction, 
which makes the conjugate gradient algorithm practically usable. 

For a differentiable function F, the standard conjugate gradient is given by Algorithm 1. 


Algorithm 1 Conjugate gradient 


F 

: differentiable function 


n 

number of iterations 


1 

procedure CONjGrad(F, n) 


2 

Compute an initial guess Xq 


3 

go = -VF{xo) 

> Steepest direction at iteration 0 

4 

0 

0 


5 

for A; 0 , n do 


6 

Ofc = argmin {F{xk + apk)} 

> Line search 

7 

“1“ ^kPk 

> Update variable 

8 

gk+i = -VF{xk+i) 

> Update Steepest direction 

9 

j3k = > Update (3, for example with the Polak-Ribiere rule 

9i 9k 

10 

Pk+l — 9k+l + PkPk 

> New conjugate direction 

11 

end for 


12 

return Xn 


13 

end procedure 
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B. Prom conjugate gradient to conjugate subgradient 


In the basic subgradient method 


Xt+I = Xt- 'itPt Pk 6 dF{Xk) 


( 112 ) 


the direction pk is any subgradient dF{xk), which is a drawback of this method since there 
is no indication of which subgradient should be chosen. As a result, the conjugate subgra¬ 
dient is not a descent method; the objective function can increase during the optimization 
process [10]. 

To build an algorithm based on the conjugate gradient, one has to dehne an unique descent 
direction at each iteration, which means choosing between all the possible subgradients dF 
when F is not differentiable. 

The basic idea is to rely on the quadratic part V/ of the gradient. Once the gradient of 
the smooth part V f{x) is calculated, the subgradient of the LI part g is evaluated with : 


dg{x) 


{ sign (x) if X 7 ^ 0 

sign(V/(x)) if X = 0 


( 11 . 3 ) 


Notice that using (II.3), the subderivative of F = / -|- is always single-valued. The 
motivation of such a choice is that when the variable x comes near the singularity of g = || ■ || ^, 
every direction (subgradient) is possible. The idea is then to go in the same direction than 
the quadratic term is “pushing" to. 

The use of (II.3) to compute the subgradient enables to solve the indecision of which 
subgradient should be chosen, and makes possible the construction of a conjugate directions 
basis. The standard Polak-Ribiere method can be used to update the conjugate direction 
from the previous directions. 

A crucial point for the convergence rate is the use of a preconditioner. In our method, 
the preconditioner relies on the magnitude of the quadratic part of the gradient V/. 

From the variables x^+i, pk, Qk (see Algorithm 1), three new preconditioned variables 
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Xfc+i, Pfc+ 1 , qk+i are built with the following preconditioner : 


D = 


1 if |V/(Mfc © Xk+i)\ < (3 and Xk ■ Xk+i < 0 


0 otherwise 
Mfc+i = min {Mu ■ {1 - 'jD + - D )), 1) 

0 if |V/(Mfc 0 Xfc+i)! < /3 and |a;| < £ 

1 otherwise 
Mk+i 


Sk+i — 


Vk+i = 


Mk 


(11.4) 


^ — _ ^k+l Q 

^k+1 7T ■ 

Vk+l 

' PM=pi^-yM-SM 

Qk+i = Qk ■ 14+1 ■ Sk+i 

all the operation being componentwise except for the argument of / which is obtained with 
the componentwise multiplication 0 between the vector Xk+i and the vector of precondi¬ 
tioning multiplying factors. 

The rationale of this preconditioner can be summarized as follow ; 


• When the gradient magnitude of the quadratic part V/ is important, the components 
of the variables are updated as in the conjugate gradient method - without variable 
substitution - since the quadratic part is predominant over the non-smooth part. 

• When |V/| is small, the standard conjugate gradient method would be disturbed by 
frequent crossings of regions where the gradient of g is discontinuous. The rule used 
is that the preconditioning factors are increasingly shrunk by a factor 7 < 1 as long 
as they should be updated. The criterion is to check if the previous preconditioned 
variable {xk) and the variable updated after the line search {xk+i) have an opposite 
sign. This variable substitution is implemented by the coefficient vector Mk- 

• The exponent a, used in the determination of the vector Pk+i is a tunable number. 
The vector Pk+i is used in the composition of the Pk+i descent direction(see Algorithm 
1). By using a number a > — 1 we tend to avoid constructing descent directions which 
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bring us too fast to non-smooth regions. Keeping a = — 1 corresponds to using the 
previous descend direction as in standard conjugate gradient method. 

• Another rule is that during this phase (small quadratic gradient), the components 
which are “small enough" (below a threshold e) are set - and will remain as long as 
the force on them is weak- to zero. This rule is especially important for the convergence 
toward solutions with high sparsity. This rule is implemented by the matrix Sk- 

The conjugate subgradient algorithm for LASSO optimization is given by Algorithm 2 
Algorithm 2 Conjugate subgradient 

F : function to optimize, F{x) = f{x) + g{x) with / the quadratic part and g the LI part 
7 , (5, e ; parameters for update the preconditioner (see (11.4)) 
n : number of iterations 

1: procedure CONjSubGrad(F, ( 7 , 5, e), n) 

2: Compute an initial guess Xq 

3 : g^ = —VF{xq) > Steepest direction at iteration 0 

4: Po = go 

5 : Mq = 1 > Element-wise 

6: for /c ^ 0, n do 

7 : q^ = MkQ A^A{Mk © Pk) 

8 : Compute = argmin {F{Mk 0 {xk + «Pfc))} 

OL 

9 : ^kPk 

10 : Update preconditioners 5 a;+i, V^+i) using (IL4) 

11 : Update {xk+u Pk+i, Qk+i) using (11.5) 

12: 9k+i = F(xk+i 0 Mk+i) © Sk+1 © Mk+i 

13 : ^ = -^±1^ 

Qk+lPk+l 

14 : Pk+I = gk +1 + f3pk+i 

15 : end for 

16 : return Xn 

17 : end procedure 










C. Line search 


The line search is a crucial step of gradient methods. The variables are updated with the 
previously computed conjugate direction The step in this direction should be such as 

ak = argmin {F{Mk+i © Xk+i)} with Xk+i = Xk + apk (II.6) 

a 

The computation of (11.6) can be done “blindly" with a generic line search, but here one 
can benefit from both the quadratic nature of / and the convex property of g. We discuss 
how to do it in this session, discarding for conciseness, and without loss of generality, the 
notation of preconditioner vector M. 

Regarding the quadratic part /, it is easily shown that 

f{xk + apk) = ^ ||-4(a;A: + apk) - h\\\ = 020 ^ + aia + oq (II.7) 

with 02 = ^plA^Apk, oi = pIA^ {Axk - b), ao = ^ {x^A^Axk + h^h) 

The coefficients 02 and Oi can be computed once for all before the line search ; actually, 
they are also used elsewhere in the algorithm so they have to be computed anyway. The 
evaluation of the derivative of / with respect to the scalar a, only requires these two 
coefficients, and thus has virtually no cost. 

Another interesting property of smooth quadratic function f{x) = \\Ax — &II 2 is 

V/(a:A:+i) = Vf{xk) + akA^Apk (II.8) 

The quantity A'^Apk is also reused, for example with the computation of p'^A'^Apk- Hence 
the update of the gradient Vf{xk+i) from the previous gradient V f{xk) is cheap. 

For a smooth quadratic function, the line search is straightforward: 

df r d 

^ ^ da ^ ' da^^"^^ 

= Pk (V/(a;fc) + aA^Apk) using (II.8) 
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which gives 


ttfe — 


plA^Apk 


(11.9) 


Now, getting back to the whole function F = f + g, a. one-step line search like (11.9) is 
not possible since one cannot extract a from dg{xk+i)- However, due to the convexity of g^ 
an upper bound of can be computed using the following property : 

Property 1. For all k, we have p1dg{xk+i) > P^dg{xk)- 

Proof. Since g is convex, every component dg'^ of its subgradient is increasing. Thus, we 
have dg^Xk+iY > dg{xkY if and only if x\j^^ > i.e > 0 (since > 0). Thus : 

• If pI > 0, then = x^ akp\. > x^, so ^^(xfc+i)* > dg{xk)\ so pi ■ dg{xk+i) > 

Pk ■ dg{xk). 

• Similarly, if pi < 0, then dg^Xk+iY < dg{xkY so pi ■ dg{xk+iY > pI ■ dg{xkY- 

Doing the scalar product, we have in any case pfdg^Xk+i) > Pkdg{xk) □ 

Using this property, we can derive the same calculation as for (II.9) : 

dF d/ dg 

0 = — = ^ ^ 

da da da 

= pl^fM + aplA^Apk + pldg{xk+i) (^^-^ 0 ) 

> pI (V/(xfc) + dg{xk)) + aplA^Apk 


Thus 


Ofc < ttfc = 


-pldF{xk) 


( 11 . 11 ) 


pfA^Apk 

For the last inequality in (II. 10), property 1 has been applied. The upper bound a^ is 
convenient for a line search using the bisection method. For example, the line search can 
be done using the regula falsi method at the beginning when the differentiable L2 part is 
predominant, and then the bisection method when the LI part becomes more important. 


III. APPLICATIONS 

In this section, numerical examples are provided to compare the convergence of this new 
method with Nesterov algorithm [9], also known as FISTA [11] which is a state-of-art convex 
non-smooth optimization method. 
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A. Example on ill-conditioned matrix 


This example illustrates the convergence rate of the conjugate subgradient algorithm for 
problem (II.l), where the matrix A is chosen to be ill-conditioned. The code to compute 
this example can be found at [12] In this example, A is a 1000 x 1000 symmetric matrix, 
with a condition number k = , gf^n-i 4 — 5.93 ■ 10^®. The eigenvalues of A are 

plotted on Figure III.l. 



n 

Figure III.l: Logarithmic plot of the eigenvalues of the matrix A 

The algorithm was run with the parameters 7 = 0.85 and 6 = 0.04, the regularization 
parameter was f3 = 0.1 and the exponent for direction p was a = 1. Figure III.2 shows the 
objective function values F{x) — F{xoo) for 2000 iterations for the two methods. It can be 
seen that CSG achieves the solution in about 800 iterations, while FISTA needs much more 
iterations to converge. Also, the objective function values are always smaller for CSG. 
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Figure III.2: Logarithmic plot of objective function values for CSG and Nesterov algorithm 


B. Tomographic reconstruction with the dictionary-learning regularization and 
ring-artifacts correction 

Tomographic reconstruction is another example of linear inverse problem. In the last 
years, an increasing interest was shown for iterative techniques with regularization, which 
can be seen as an extension of the standard Algebraic Reconstruction Technique and Simul¬ 
taneous Iterative Reconstruction Technique. These techniques bring many opportunities, 
for example modeling more accurately the process, incorporating a priori knowledge on 
the volume and correcting artifacts. A prominent application is the low-dose tomography 
reconstruction. 

Iterative tomographic reconstruction amounts to an optimization problem An example is 
the the total variation reconstruction 

argmin {||Pa; — (i ||2 +/d ||Va;||j^} (IH-l) 

X 

which penalizes the nonzero components of the gradient of the slice, promoting piecewise 
constant results. Here x denotes the slice (or volume) to be reconstructed, P is the projection 
operator, d is the acquired sinogram and /d is a factor weighting the sparsity of the gradient 
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of the solution. Another example is the dictionary learning reconstruction 

argmin [\\PDw — d\\l + f3 \\w\\^] (III-2) 

W 

which promotes the sparsity of the slice in an appropriate basis D : either a learned dictio¬ 
nary [13] or a Wavelet transform. 

Notice that (Uhl) correspond to an analysis formulation while (HI.2) is a synthesis 
formulation, for which the conjugate subgradient can be applied. 

In this example, the standard 512 x 512 test image Lena was used. According to the 
Nyquist criterion, |512 ~ 800 projections would be required to get an appropriate recon¬ 
struction quality with the Filtered Back Projection. With iterative techniques promoting 
sparsity, this number can be dramatically decreased according to the Compressive Sensing 
theory [14]. Here only 80 projections were used to demonstrate the abilities of the Dictio¬ 
nary Learning technique. Additionally, rings artifacts were simulated by adding lines in the 
sinogram. The lines values are not constant along the projection angle, which makes the 
problem more challenging. To take the rings correction into account [15] , the reconstruction 
problem is written as (III.3). 

argmin If{w) = ||PDtc -|- 1 x — d\\^^ + (3 ]|tc]|^ -|- jdr (III.3) 

In this formalism, a ring vector r is added to each projection line of the sinogram - the 
rings artifacts are modeled as constant values along the projection angle in the sinogram. 
The sinogram has dimensionality {Np, N) where Np is the number of projections and N is the 
number of pixels in one dimension of the slice. The operation 1 x consists in multiplying 
a {Np, 1) vector of ones with a (1, N) vector r. 

The functional (III.3) was minimized with two techniques implemented in the PyHST2 
code [16] ; Nesterov algorithm (FISTA) and this conjugate subgradient algorithm (CSG). In 
this test, an over-complete dictionary has been used, resulting in an ill-conditioned problem 
which is a difficult test case for optimization algorithms. Moreover we observed that, for 
this kind of problem, the transfer of energy from the reconstructed image to the auxiliary 
variables capturing the spurious artifacts (r) occurs in the hnal part of the convergence and 
is slow with the FISTA. The best convergency properties were obtained with a = 0. 

Figure III.3 shows the plot of the normalized objective function F{w) — F{woo) for 8000 
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Figure III.3: Logarithmic and linear plots of the values of the objective function for both methods 

iterations. Both methods converge to the same hnal valne since the same fnnctional F{w) 
is minimized, bnt the last stage of the optimization process is mnch faster for the conjngate 
snbgradient algorithm. Fignre III.4 shows the reconstrncted images with Filtered Back 
Projection and the Dictionary Learning techniqne, for parameters /3 = 0.7 and (3^ = 10. 
It can be noted that the rings artifacts are almost entirely removed, even with the simple 
“constant rings" modeling. 
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(c) Dictionary Learning 


Figure III.4: Phantom of Lena reconstructed with 80 projection angles. Lines were added to the 
sinogram to simulate ring artifacts. 


IV. CONCLUSIONS 

We have presented a specialized Conjugate Sub Gradient method which we have tailored 
for the LASSO minimization. This method is £t to cope at the same time with the ill- 
conditioning of the LASSO matrix and the discontinuities in the hrst derivative. We have 
tested our method on two difficult cases and found excellent acceleration, outperforming 
state-of-the art algorithms. An implementation of CSG can be found at [12]. 
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